Deep Learning Nyu.Week 5

Gradient Descent
- Worst optimization method in the world
Optimization problem
- minimize f(w) over w
- w_{(k + 1)} = w_k - (step) * (Del) f(w_k)
Assume f is differentiable and continuous – not true
- Actually sub differentiable
- "It should work; no theory to support this"
- Follow the direction of the negative gradient
we look at the optimization landscape locally
- landscape = domain of all weights in the neural network
- find the best solution relative to where we are
Consider a quadratic optimization problem
- positive definite case
- can calculate this as matrix * distance from solution
- which gives 1 - smallest eigenvalue / largest eigenvalue step reduction in step size
- largest / smallest = condition number
- poorly conditioned – l is very large, well conditioned l is small
Step sizes
- we don't have a good estimate of learning rate
- try a bunch of values on the log scale
- ideally choose an optimal step size
- we tend to choose the largest possible learning rate – at the edge of divergence
Stochastic optimization
- Actually used to train nets in practice
- Replace gradient with a stochastic approximation to the gradient
  - Gradient of the loss for a single instance
  - instance chosen uniformly at random
  - (full loss is sum of all the fis)
- Expected value of sgd is full gradient
  - useful to think of it as gd with noise
Annealing
- neural network landscapes are bumpy
- SGD -> particularly the noise helps it jump over these minima
- good minima are larger and harder to skip
Also valuable because
- we have a lot of redundancy
- SGD exploits the redundancy
- can be thousands of times cheaper
- can be hard to trust GD instead
Minibatching
- use batches randomly chosen
- practical reasons are overwhelming
  - much more efficient utilization of hardware
  - eg. imagenet uses batch sizes of 64
- distributed training
  - "ImageNet in one hour"
Full batch
- Do not gradient descent
- LBFGS
  - 50 years of optim research
  - scipy has a bulletproof implementation
CPU doesn't have batch size critical
Always try mini-batching
Momentum
- trick to always use with SGD
- momentum parameter in the network
- w_{(k + 1)} = wk - gamma_k * delta + beta_k * (w_k - w_(k-1))
- update both p and w – damp the old momentum and add gradient
- p is accumulated gradient buffer – past gradients are reduced – running sum of gradients
- stochastic descent uses gradient
- "Stochastic heavy ball method"
- gradient keeps pushing the direction in the same direction instead of dramatic changes
- small beta – can change direction more quickly; high beta makes it harder to turn
  - high beta helps dampen oscillations
  - beta = .9, .99 always works well
  - momentum also increases the step size (for past gradients)
  - change step size to 1/(1 - beta).
- why it works
  - acceleration contributes to performance
- Nesterov – did a lot of research
- Acceleration
- Noise smoothing
  - momentum averages gradients
  - it adds smoothing that makes things become a good approximation to the solution
  - reduces the bouncing
SGD works – well conditioned
- otherwise poorly conditioned
Adaptive methods
- Maintain an estimate of a better rate separately for each weight
- lots of different ways to do this
- smaller learning rates for weights later in the network, larger in the early weights
- fairly hand-wavy
RMSProp
- normalize by root mean square of the gradient
- ![2021₀₆₁₇_image.png]()
ADAM: Adaptive moment estimation
- ![2021₀₆₁₇_image.png]()
- Bias correction in full adam increases the value during early stages
- Occasionally doesn't converge
- Poorly understood
- Has worse generalization error
- Small neural networks will have different results depending on initial values
Normalization layers
- Linear -> norm -> activation or
- Conv -> norm -> ReLu
- They don't make the network more powerful
- Whitening operation to update the data
  - with some additional parameters to allow all ranges of values
  - adds more parameters to the layer: learnable scaling and bias term
  - y = a / stddev * (x - mean) + b
  - often they reverse the parametrization
  - a & b move slowly as they're learned
Batch norm
- bizarre, but works very well
- normalize across batch
- estimates mean and stddev across all instances in a mini batch
- breaks all the theory of SGD
- layer instance and group norm are other norms that work
  - group norm works where batch norm works
Why does normalization help?
- the network becomes easier to optimize, can use larger lrs
- adds noise, which helps with generalization
- makes weight initialization less important
- allows plugging together multiple layers with impunity
  - allows for automated architecture search
  - typically resulted in a poorly conditioned network
have to backpropagate through the calculation of the mean and stddev
for batch/instance norm: mean/std are fixed after training
- group/layer can update the values
Death of optimization
- try to use a big neural network to solve the optimization problem
Practicum
- Convolution dimensions output: n - k + 1 by m